Introduction

Objective: This project….

Cleaning and Packages

library(nycflights13)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(RColorBrewer)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.1.0     ✓ purrr   0.3.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x plotly::filter() masks dplyr::filter(), stats::filter()
## x dplyr::lag()     masks stats::lag()
library(sf)
## Linking to GEOS 3.8.1, GDAL 3.1.4, PROJ 6.3.1
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(viridis)
## Loading required package: viridisLite

Visual analysis

Airline Delays

Is there an airline that is more on time than others?

The first plot shows the average delay of each airlines from their destination. The highest average delays are from Frontier airlines. It seems from this graph that the lowest average delays are in the Alaska Airlines airline, in fact it seems they ahve a defecit of delays meaning they had either early or on time in their total. This makes Alaska airlines appear to be the most timely airline compared with others, however more analysis needs to be done here to assess the time delays as there are more flights recorded for certain airlines and aparant outliers which may scew the average.

#This is to get the total delay which will be used to see what airline has the most and least delays:
flights<- mutate(flights, totaldelay = arr_delay + dep_delay)

# Adding names of the carrier and changing the column name for clarity:
flights<- flights %>% left_join(airlines, by = c('carrier' = 'carrier'))
colnames(flights)[21]<- "CarrierName"

delays<-flights %>% group_by(CarrierName) %>% 
  summarise(Average_Delay = mean(totaldelay, na.rm =
TRUE)) %>% 
  ggplot(aes(reorder(CarrierName,Average_Delay), Average_Delay, fill= CarrierName)) + geom_bar(stat ='identity', color = rainbow(16)) + scale_x_discrete(guide = guide_axis(angle = 90)) 
## `summarise()` ungrouping output (override with `.groups` argument)
delays<-delays + theme(
  panel.border = element_blank(),
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  axis.line = element_line(colour = "black"))+ 
  ggtitle("Average Delay by Airline") +
  xlab("Airline") + ylab("Average Delay")


delays

Here is a list of the departure delays starting with the carrier that has the smallest average delay - Alaska Airlines Inc. with an average delay of -4.1 hours. This shows us that Alaska Airlines appears to be more on time than the other airlines in our data set.

top16 <- flights %>%
  group_by(CarrierName) %>%
  summarize(avg_delay = mean(totaldelay, na.rm = TRUE)) %>%
  arrange(avg_delay) %>%
  top_n(n = 16)
## `summarise()` ungrouping output (override with `.groups` argument)
## Selecting by avg_delay
top16
## # A tibble: 16 x 2
##    CarrierName                 avg_delay
##    <chr>                           <dbl>
##  1 Alaska Airlines Inc.            -4.10
##  2 Hawaiian Airlines Inc.          -2.01
##  3 US Airways Inc.                  5.87
##  4 American Airlines Inc.           8.93
##  5 Delta Air Lines Inc.            10.9 
##  6 Virgin America                  14.5 
##  7 United Air Lines Inc.           15.6 
##  8 Envoy Air                       21.2 
##  9 JetBlue Airways                 22.4 
## 10 Endeavor Air Inc.               23.8 
## 11 SkyWest Airlines Inc.           24.5 
## 12 Southwest Airlines Co.          27.3 
## 13 Mesa Airlines Inc.              34.5 
## 14 ExpressJet Airlines Inc.        35.6 
## 15 AirTran Airways Corporation     38.7 
## 16 Frontier Airlines Inc.          42.1

Because there are apparent outliers for some delays within each carrier, it may be helpful to look at a certain threshold; where the total delay is more than an hour, this would constitute airlines that are late departing and late arriving. This graph shows a different outcome than before, while Frontier airlines seems to have the highest average delays, it has one of the lowest flights with delays that are more than an hour. Alaska Airlines has 56 flights that had over an hour delay, compared to ExpressJet Airlines with a whopping 11503 flights with over an hour delay.

hourdelay<-flights %>% filter(totaldelay > 60) %>% count(CarrierName, sort =TRUE) %>% 
  mutate(CarrierName = factor(CarrierName, levels = CarrierName, ordered =TRUE)) %>%
ggplot(aes(CarrierName, n, fill = CarrierName)) + geom_bar(stat ='identity', color = rainbow(16)) + scale_x_discrete(guide = guide_axis(angle = 90)) 

hourdelay<-hourdelay + theme(
  panel.border = element_blank(),
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  axis.line = element_line(colour = "black"))+ 
  ggtitle("Delay by More Than an Hour") +
  xlab("Airline") + ylab("Count")
hourdelay

Airport Delays?

This map looks at the airports that have the most delays on arrival, the larger dot represents a bigger delay (in minutes).

library(albersusa)
m<-usa_sf()
library("maps")
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
states <- st_as_sf(map("state", plot = FALSE, fill = TRUE))
states <- cbind(states, st_coordinates(st_centroid(states)))
## Warning in st_centroid.sf(states): st_centroid assumes attributes are constant
## over geometries of x
## Warning in st_centroid.sfc(st_geometry(x), of_largest_polygon =
## of_largest_polygon): st_centroid does not give correct centroids for longitude/
## latitude data
library("tools")
states$ID <- toTitleCase(states$ID)


flights1 <- flights %>% group_by(dest) %>% 
  summarise(avg_delay = mean(totaldelay, na.rm = T)) %>% 
  left_join(airports, c("dest"="faa")) %>%
  arrange(desc(avg_delay))
## `summarise()` ungrouping output (override with `.groups` argument)
head(flights1)
## # A tibble: 6 x 9
##   dest  avg_delay name                lat    lon   alt    tz dst   tzone        
##   <chr>     <dbl> <chr>             <dbl>  <dbl> <dbl> <dbl> <chr> <chr>        
## 1 CAE        75.6 Columbia Metropo…  33.9  -81.1   236    -5 A     America/New_…
## 2 TUL        68.5 Tulsa Intl         36.2  -95.9   677    -6 A     America/Chic…
## 3 OKC        59.8 Will Rogers World  35.4  -97.6  1295    -6 A     America/Chic…
## 4 JAC        55.6 Jackson Hole Air…  43.6 -111.   6451    -7 A     America/Denv…
## 5 TYS        52.5 Mc Ghee Tyson      35.8  -84.0   981    -5 A     America/New_…
## 6 BHM        45.9 Birmingham Intl    33.6  -86.8   644    -6 A     America/Chic…
map1<-ggplot(data = m) +
    geom_sf(fill = "chartreuse1",alpha = 0.45) +
    geom_sf(data = states, alpha = 0.25, color = "darkgreen", size = 0.3) + 
    geom_text(data = states, aes(X, Y, label = ID),check_overlap = TRUE,fontface = "bold", size = 2) +
  geom_point(data = flights1, aes(x = lon, y = lat, size = avg_delay, fill = name), color = "purple", alpha = 0.7)+ guides(fill = FALSE) + scale_size(range = c(0, 5), name="Average Delay in Minutes") +
  coord_sf(
    xlim = c(-130, -70),
    ylim = c(20, 50)
  )


map1<-map1 + theme(
  panel.border = element_blank(),
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  panel.background = element_rect(colour= "aliceblue", fill = "aliceblue",
                                size = 0.5, linetype = "solid"),
  axis.line = element_line(colour = "black"))+ 
  ggtitle("Average Delays by Airport") +
  xlab("Lat") + ylab("Long")
map1
## Warning: Removed 5 rows containing missing values (geom_point).

ggplotly(hide_legend(map1))
top10<- flights1 %>%
  group_by(name) %>%
  arrange(avg_delay) %>%
  top_n(n = 10)
## Selecting by tzone
top10
## # A tibble: 101 x 9
## # Groups:   name [101]
##    dest  avg_delay name                lat    lon   alt    tz dst   tzone       
##    <chr>     <dbl> <chr>             <dbl>  <dbl> <dbl> <dbl> <chr> <chr>       
##  1 LEX      -31    "Blue Grass"       38.0  -84.6   979    -5 A     America/New…
##  2 PSP      -15.7  "Palm Springs In…  33.8 -117.    477    -8 A     America/Los…
##  3 SNA       -1.09 "John Wayne Arpt…  33.7 -118.     56    -8 A     America/Los…
##  4 MVY        6.60 "Martha\\\\'s Vi…  41.4  -70.6    67    -5 A     America/New…
##  5 HNL        7.95 "Honolulu Intl"    21.3 -158.     13   -10 N     Pacific/Hon…
##  6 DFW        8.93 "Dallas Fort Wor…  32.9  -97.0   607    -6 A     America/Chi…
##  7 MIA        9.17 "Miami Intl"       25.8  -80.3     8    -5 A     America/New…
##  8 SLC        9.20 "Salt Lake City …  40.8 -112.   4227    -7 A     America/Den…
##  9 SEA        9.50 "Seattle Tacoma …  47.4 -122.    433    -8 A     America/Los…
## 10 LAS        9.63 "Mc Carran Intl"   36.1 -115.   2141    -8 A     America/Los…
## # … with 91 more rows

Potential delay causes

#check the time-zones flown to
airports %>% group_by(tzone) %>% summarise(n=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 10 x 2
##    tzone                   n
##    <chr>               <int>
##  1 America/Anchorage     239
##  2 America/Chicago       342
##  3 America/Denver        119
##  4 America/Los_Angeles   176
##  5 America/New_York      519
##  6 America/Phoenix        38
##  7 America/Vancouver       2
##  8 Asia/Chongqing          2
##  9 Pacific/Honolulu       18
## 10 <NA>                    3
flightsx <- left_join(flights,weather) %>%
  dplyr::select(day, month,dep_delay, arr_delay, carrier,origin, dest, air_time, distance ,time_hour, temp:visib) %>%
  subset(select= -(wind_gust)) #rm wind_gust cause missing too many values
## Joining, by = c("year", "month", "day", "origin", "hour", "time_hour")
# add time_status and duration columns
flightsx$time_status <- NA
flightsx1 <- flightsx %>% filter(dep_delay > 15) %>% mutate(time_status="delayed")
flightsx2 <- flightsx %>% filter(dep_delay <= 15) %>% mutate(time_status="on_time")
flightsx3<-rbind(flightsx1,flightsx2)
flightsx3 <- flightsx3 %>% arrange(day,month,time_hour)
flightsx3
## # A tibble: 328,521 x 19
##      day month dep_delay arr_delay carrier origin dest  air_time distance
##    <int> <int>     <dbl>     <dbl> <chr>   <chr>  <chr>    <dbl>    <dbl>
##  1     1     1         2        11 UA      EWR    IAH        227     1400
##  2     1     1         4        20 UA      LGA    IAH        227     1416
##  3     1     1         2        33 AA      JFK    MIA        160     1089
##  4     1     1        -1       -18 B6      JFK    BQN        183     1576
##  5     1     1        -4        12 UA      EWR    ORD        150      719
##  6     1     1         0        -4 B6      JFK    BOS         44      187
##  7     1     1        24        12 EV      EWR    IAD         52      212
##  8     1     1        47        30 UA      EWR    MCO        145      937
##  9     1     1       101       137 MQ      LGA    CLT        118      544
## 10     1     1        -6       -25 DL      LGA    ATL        116      762
## # … with 328,511 more rows, and 10 more variables: time_hour <dttm>,
## #   temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>, wind_speed <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_status <chr>
flightsx3$duration <- NA
flightsx12 <- flightsx3 %>% filter(air_time < 180) %>% mutate(duration="short haul")
flightsx22 <- flightsx3 %>% filter(air_time >= 180 & air_time < 360) %>% mutate(duration="medium haul")
flightsx32 <- flightsx3 %>% filter(air_time >= 360) %>% mutate(duration="long haul")
flightsx4<-rbind(flightsx12,flightsx22,flightsx32)
flightsx4
## # A tibble: 327,346 x 20
##      day month dep_delay arr_delay carrier origin dest  air_time distance
##    <int> <int>     <dbl>     <dbl> <chr>   <chr>  <chr>    <dbl>    <dbl>
##  1     1     1         2        33 AA      JFK    MIA        160     1089
##  2     1     1        -4        12 UA      EWR    ORD        150      719
##  3     1     1         0        -4 B6      JFK    BOS         44      187
##  4     1     1        24        12 EV      EWR    IAD         52      212
##  5     1     1        47        30 UA      EWR    MCO        145      937
##  6     1     1       101       137 MQ      LGA    CLT        118      544
##  7     1     1        -6       -25 DL      LGA    ATL        116      762
##  8     1     1        -5        19 B6      EWR    FLL        158     1065
##  9     1     1        -3       -14 EV      LGA    IAD         53      229
## 10     1     1        -3        -8 B6      JFK    MCO        140      944
## # … with 327,336 more rows, and 11 more variables: time_hour <dttm>,
## #   temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>, wind_speed <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_status <chr>,
## #   duration <chr>

Check which NY airport has best delays record

addmargins(table(flightsx3$origin,flightsx3$time_status))
##      
##       delayed on_time    Sum
##   EWR   28942   88654 117596
##   JFK   22650   86766 109416
##   LGA   19182   82327 101509
##   Sum   70774  257747 328521
airports1 <- c("EWR","JFK","LGA")
delayed <- c(24.6,20.7,18.9)
on_time <- c(75.4,79.3,81.1)
df <-data.frame(airports1,delayed,on_time)
df
##   airports1 delayed on_time
## 1       EWR    24.6    75.4
## 2       JFK    20.7    79.3
## 3       LGA    18.9    81.1
ggplot(data = df, aes(x = airports1,y=delayed,fill=airports1)) +
  geom_col() +
  ggtitle("Percentage of flights delayed")

ggplot(data = df, aes(x = airports1,y=on_time,fill=airports1)) +
  geom_col() +
  ggtitle("Percentage of flights on time")

# Newark has the most flights but also has a much higher % of delayed flights as
#displayed in df above

Look for any possible contributing factors towards these delays: So by grouping by the three Origin airports for New York (Newark, JFK and La Guardia) and calculating the average values for the weather data it is possible to test correlations between these values and determine contributing weather factors towards departure delays from these airports. Also calculated is the correlation between the number of flights leaving each airport with the departure delays.

#first instinct is to check weather
weather <- nycflights13::weather
df<- weather %>% left_join(flights) %>%
  filter_at(vars(dep_delay,temp,humid,wind_speed,precip,pressure,visib), all_vars(!is.na(.))) %>%
  group_by(origin) %>%
  summarise(avg_dep_delay = mean(dep_delay),avg_temp = mean(temp), avg_humid = mean(humid), avg_wind = mean(wind_speed),
            avg_precip = mean(precip), avg_pressure = mean(pressure), avg_visib = mean(visib), n=n())
## Joining, by = c("origin", "year", "month", "day", "hour", "time_hour")
## `summarise()` ungrouping output (override with `.groups` argument)
#need to figure a way to relate these data, it seems that weather doesnt seem to have as much effect and it is actually more to do with trafic flow on the data....

library(corrplot)
## corrplot 0.84 loaded
cors <- cor(df[,-1])
corrplot(cors, type="upper")

When comparing mean departure delays for the three airports it can be seen from the dataframe above that the ‘worst’ NY airport for departure delays is Newark with an average of ~13.5 minutes. Interestingly though mean weather values vary between the three airports, meaning there is no particular airport with noticeably more adverse weather conditions (to be expected since these are all within ~20km of each other). From the correlation plot it can be seen that a very strong correlation (~0.979) is calculated for departure delays and number of flights (n). And from the dataframe it can be seen that indeed the departure delays decrease across the three airports with decreasing number of flights. Thus this will be investigated further:

Below the percentage of flights that are on-time/delayed is calculated and displayed

df<-flightsx4 %>% group_by(duration, time_status) %>%
  summarise(n=n())
## `summarise()` regrouping output by 'duration' (override with `.groups` argument)
#from the dataframe, the percentages can be calculated as:
perc <- c(17.8,82.2,19.7,80.3,22.2,77.8)
#thus plotted
df$percentage_of_flights <- perc
ggplot(data=df,aes(x=time_status,y=percentage_of_flights,fill=percentage_of_flights)) +
  geom_col() +
  facet_wrap(~duration) +
  ggtitle("Percentage of flights delayed/on-time by flight duration")

In order to determine if excessive flight numbers have an effect on delays below average delays by month is calculated and compared with flights numbers by month, these are split by the three NY airports

data <- flights %>% 
  dplyr::select(origin, month, day ,arr_delay, dep_delay) %>%   
  group_by(origin, month) %>% 
  summarise(avg_delay =  mean(dep_delay, na.rm = TRUE)) %>%
  ungroup() %>%
  arrange(-avg_delay)
## `summarise()` regrouping output by 'origin' (override with `.groups` argument)
ggplot(data, aes(x=month, y=avg_delay)) +
  geom_point(aes(color = origin)) + xlab("Month") + 
  ylab("Average Delay") + geom_smooth(method=loess,color = "Red") +
  ggtitle("NY airports average delays by month") +
  scale_x_continuous(breaks=1:12, labels=c("Jan","Feb","Mar","Apr","May","Jun",
                                           "Jul","Aug","Sept","Oct","Nov","Dec"))
## `geom_smooth()` using formula 'y ~ x'

data1 <- flights %>% group_by(origin, month)  %>%
  summarise(n=n()) %>%
  ungroup() %>%
  arrange(-n)
## `summarise()` regrouping output by 'origin' (override with `.groups` argument)
ggplot(data1, aes(x=month, y=n)) +
  geom_point(aes(color=origin)) +
  geom_smooth() +
  ggtitle("Average number of flights per month for the three NY airports") +
  scale_x_continuous(breaks=1:12, labels=c("Jan","Feb","Mar","Apr","May","Jun",
                                           "Jul","Aug","Sept","Oct","Nov","Dec"))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

As seen above the delays are biggest in June and July, which is the largest period also for flights leaving the three airports. The decrease in delays early in 2013 matches the lower flights numbers, as does the sharp decrease in average delays starting in August to Oct/Nov, then there is a sharp increase in Delays for December which goes against the trend slightly.

Carrier analysis

library(nycflights13)
library(dplyr)
library(ggplot2)

flights %>%
group_by(carrier) %>%
summarise(ave_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(ave_delay) %>%
mutate(carrier = factor(carrier, levels = carrier, ordered = TRUE)) %>%
ggplot(aes(carrier, ave_delay)) + geom_bar(stat = 'identity')
## `summarise()` ungrouping output (override with `.groups` argument)

flights <- flights %>% 
        mutate(ontime = arr_delay < 5)
flights <- flights %>%
  mutate(arr_type = ifelse(arr_delay < 5, "on time", "delayed"))

flights %>% group_by(carrier) %>% summarise(ontime_prop = sum(ontime == TRUE) / n()) %>% 
arrange(desc(ontime_prop))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 16 x 2
##    carrier ontime_prop
##    <chr>         <dbl>
##  1 HA            0.772
##  2 9E           NA    
##  3 AA           NA    
##  4 AS           NA    
##  5 B6           NA    
##  6 DL           NA    
##  7 EV           NA    
##  8 F9           NA    
##  9 FL           NA    
## 10 MQ           NA    
## 11 OO           NA    
## 12 UA           NA    
## 13 US           NA    
## 14 VX           NA    
## 15 WN           NA    
## 16 YV           NA
ggplot(data = flights, aes(x = carrier , fill = arr_type)) + labs(title = 'Total count of delayed and on time fligths by carrier', subtitle = 'On time if arr_delay < 5mins') + geom_bar(position = position_dodge(preserve = "single"))

Weather analysis

Conclusion